Goto

Collaborating Authors

 chronicling america


Model Details

Neural Information Processing Systems

We decreased the confidence threshold to 0.1 to increase article and headline The following specifications were used: { resolution: 256, learning rate: 2e-3 }. This limit is binding for common words, e.g., "the". The recognizer is trained using the Supervised Contrastive ("SupCon") loss function [7], a gener-45 In particular, we work with the "outside" SupCon loss formulation We use a MobileNetV3 (Small) encoder pre-trained on ImageNet1k sourced from the timm [19] We use 0.1 as the temperature for Center Cropping, to avoid destroying too much information. C (Small) model that is developed in [2] for character recognition. If multiple article bounding boxes satisfy these rules for a given headline, then we take the highest.




ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Piryani, Bhawna, Mozafari, Jamshid, Jatowt, Adam

arXiv.org Artificial Intelligence

Question answering (QA) and Machine Reading Comprehension (MRC) tasks have significantly advanced in recent years due to the rapid development of deep learning techniques and, more recently, large language models. At the same time, many benchmark datasets have become available for QA and MRC tasks. However, most existing large-scale benchmark datasets have been created predominantly using synchronous document collections like Wikipedia or the Web. Archival document collections, such as historical newspapers, contain valuable information from the past that is still not widely used to train large language models. To further contribute to advancing QA and MRC tasks and to overcome the limitation of previous datasets, we introduce ChroniclingAmericaQA, a large-scale temporal QA dataset with 487K question-answer pairs created based on the historical newspaper collection Chronicling America. Our dataset is constructed from a subset of the Chronicling America newspaper collection spanning 120 years. One of the significant challenges for utilizing digitized historical newspaper collections is the low quality of OCR text. Therefore, to enable realistic testing of QA models, our dataset can be used in three different ways: answering questions from raw and noisy content, answering questions from cleaner, corrected version of the content, as well as answering questions from scanned images of newspaper pages. This and the fact that ChroniclingAmericaQA spans the longest time period among available QA datasets make it quite a unique and useful resource.


American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

Dell, Melissa, Carlson, Jacob, Bryan, Tom, Silcock, Emily, Arora, Abhishek, Shen, Zejiang, D'Amico-Wong, Luca, Le, Quan, Querubin, Pablo, Heldring, Leander

arXiv.org Artificial Intelligence

Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.


How To Search Historical Newspaper Images Using Artificial Intelligence

#artificialintelligence

Teachers and students (or anyone else in the public, for that matter) can now explore more than 1.5 million historical newspaper images online using artificial intelligence. The latest machine learning experience from LC Labs, Newspaper Navigator allows users to search visual content in American newspapers dating from 1789-1963. The user begins by entering a keyword that returns a selection of photos. Then the user can choose photos to search against, allowing the discovery of related images that were previously undetectable by search engines. For decades, partners across the United States have collaborated to digitize newspapers through the Library's Chronicling America website, a database of historical U.S. newspapers.


Newspaper Navigator

University of Washington Computer Science

Welcome to the Newspaper Navigator dataset! This dataset consists of extracted visual content for 16,358,041 historic newspaper pages in Chronicling America. The visual content was identified using an object detection model trained on annotations of World War 1-era Chronicling America pages, including annotations made by volunteers as part of the Beyond Words crowdsourcing project. The dataset also includes text corresponding to the visual content, identified by extracting the Optical Character Recognition, or OCR, within each predicted bounding box. For example, if the visual content recognition model predicted a bounding box around a headline, the corresponding textual content provides a machine-readable version of the headline; likewise, for a photograph, illustration, or map, this textual representation often contains the title and caption.